72 research outputs found
Sample Efficient Policy Search for Optimal Stopping Domains
Optimal stopping problems consider the question of deciding when to stop an
observation-generating process in order to maximize a return. We examine the
problem of simultaneously learning and planning in such domains, when data is
collected directly from the environment. We propose GFSE, a simple and flexible
model-free policy search method that reuses data for sample efficiency by
leveraging problem structure. We bound the sample complexity of our approach to
guarantee uniform convergence of policy value estimates, tightening existing
PAC bounds to achieve logarithmic dependence on horizon length for our setting.
We also examine the benefit of our method against prevalent model-based and
model-free approaches on 3 domains taken from diverse fields.Comment: To appear in IJCAI-201
Being Optimistic to Be Conservative: Quickly Learning a CVaR Policy
While maximizing expected return is the goal in most reinforcement learning
approaches, risk-sensitive objectives such as conditional value at risk (CVaR)
are more suitable for many high-stakes applications. However, relatively little
is known about how to explore to quickly learn policies with good CVaR. In this
paper, we present the first algorithm for sample-efficient learning of
CVaR-optimal policies in Markov decision processes based on the optimism in the
face of uncertainty principle. This method relies on a novel optimistic version
of the distributional Bellman operator that moves probability mass from the
lower to the upper tail of the return distribution. We prove asymptotic
convergence and optimism of this operator for the tabular policy evaluation
case. We further demonstrate that our algorithm finds CVaR-optimal policies
substantially faster than existing baselines in several simulated environments
with discrete and continuous state spaces
- …